The Performance Paradox states that a mathematically perfect kernel, such as $out = x + y$, can actually perform worse than a CPU loop if it fails to amortize the fixed costs of the GPU hardware. This often manifests as the Launch Tax.
1. The "Correctness" Fallacy
Functional correctness is not a proxy for efficiency. While your Triton code might correctly distribute work across thousands of threads, if the total amount of work (N) is small, the GPU remains underutilized. The hardware spends more time in state transitions than in actual arithmetic.
2. The Python Measurement Trap
Benchmarking GPU code from Python using time.time() is dangerous. GPU calls are asynchronous; Python merely queues the command and moves on. Without torch.cuda.synchronize(), you measure the queueing time. With synchronization, you measure the Host-to-Device latency, which is often 10x longer than the kernel execution itself.
3. Latency vs. Throughput
To overcome the paradox, you must provide enough work to "hide" the launch latency. This is the transition from a latency-bound regime (limited by the CPU-GPU bus) to a throughput-bound regime (limited by GPU memory or compute).